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^^ , We establish the fundamental limits of DNA shotgun sequencing under noisy reads. We show a surprising 
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result: for the i.i.d. DNA model, noisy reads are as good as noiseless reads, provided that the noise level is 
below a certain threshold which can be surprisingly high. As an example, for a uniformly distributed DNA 
sequence and a symmetric substitution noisy read channel, the threshold is as high as 19%. 

I. Introduction 

DNA sequencing is the basic workhorse of modern day biology and medicine. Since the sequencing 

• , of the Human Reference Genome ten years ago, there has been an explosive advance in sequencing 
O ■ technology. Multiple "next-generation" sequencing platforms have emerged. All of them are based on 

the whole-genome shotgun sequencing method. The basic shotgun DNA sequencing set-up is shown 
in Figured] Starting with a DNA molecule, the goal is to obtain the sequence of bases {A, C, G or T) 
comprising it. The sequencing machine extracts a large number of reads from the DNA; each read 

^ is a randomly located fragment of the DNA sequence. The DNA assembly problem is to reconstruct 

t^^ the DNA sequence from the many reads. 

• i A basic question, still largely open, is the following: given DNA sequence statistics and charac- 
Q ' teristics of the sequencing technology such as read length and noise statistics, how many reads are 
cn ' needed to reconstruct the original DNA sequence, if it is possible at all? The answer to this question 
^ . can provide an algorithm-independent basis for evaluating the efficiency of a sequencing technology 

K>- ' and can be used to compare different assembly algorithms. [7j provides an answer to this question 

KA' in a simple setting: 1) each read has the same length L bases and is uniformly and independently 

H ■ sampled from the length G DNA sequence; 2)the DNA sequence is modeled as an i.i.d. string; 3) 

- - - the read process is noiseless. The main result shows that in the asymptotic regime where L and 

G — > oo with L = L/logG fixed, a critical phenomenon occurs: when L < Lent-, reconstruction 

is impossible, and when L > L^^, then having enough reads to cover the DNA sequence is also 

sufficient for reconstruction. Here, Lent = '^/H^enyi, where -ffrenyi is the Renyi entropy rate of order 2. 

The significance of Lcrit is that with high probability, there are no repeats of length more than Lcrit 

in the DNA sequence. The coverage bound is a well-known lower bound introduced by Lander and 

Waterman \U\ in the early days of sequencing. Thus, the result says that as long as the read length 

is longer than the longest repeat in the DNA, this lower bound is asymptotically tight. 

In pp, the theory of noiseless assembly is extended to DNA sequences with arbitrary repeat 
statistics. In this paper, instead, we keep the i.i.d. DNA model but we consider noisy reads. 

The optimal assembly algorithm which achieves the fundamental limit in the above setting is the 
greedy algorithm. The greedy algorithm merges reads with the largest overlap first, where the overlap 
between two reads is the longest exact match between a prefix of one read and a suffix of another 
read. A natural extension of the greedy algorithm to the noisy read case is that instead of looking 
at exact matches, one allows approximate matches, where the degree of approximation tolerated is 
a function of the read noise statistics. The performance analysis of such an algorithm under noisy 
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Fig. 1. Schematic for shotgun sequencing. 

reads was considered in [7] . Not so surprisingly, noise always degrades the performance of the greedy 
algorithm, and in fact the effect is quite significant. 

The modification of the greedy algorithm is only one approach to deal with noise. But are there 
better approaches? What, in fact, is the fundamental limit on the system performance under noisy 
reads? We show a surprising result in this paper: provided that the noise level is below a certain 
threshold, noise has no impact on the asymptotic performance. The threshold on the noise level is 
given by the condition 

-'read ^ -nrenyi; \^ ) 

where 
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Here, Qs is the probability that a DNA base equals s and Ti{y\s) is the probability that a DNA base 
s is read as y through the noisy read channel. In particular, under the uniform distribution Q^ = 0.25 
for all s and symmetric read channel with probability of mis-read 6, i/renyi = 2 bits and /read is the 
capacity of the read channel: 

/read = -5l0g--(l-<5)l0g(l-<5). 

The condition ([T]) translates to a threshold of 6* = 0.19 for this example. As long as the noise level 
is below 19%, the noiseless performance can be achieved, i.e. coverage is sufficient when L > Lent = 

^ I J^ renyi ■ 

In communication, noise almost always has a detrimental effect on asymptotic performance, as it 
degrades the channel capacity. So we would like to give some intuition on why noise (below a certain 
level) has no impact on the asymptotic performance in the shotgun assembly problem considered 
here. First, we need to understand better the implication of the coverage condition. It follows from 
Lander- Waterman's results that the number of reads A'^cov needed to cover the entire DNA sequence 
with probability at least 1 — e is well approximated by: 

!■"(£ 

Thus, the coverage depth, i.e. the average number of reads covering each base, is given by: 
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For example, for G = 3 x 10^, L = 100, e = 0.05, the coverage depth c = 20. In the asymptotic hmit, 
the coverage depth goes to infinity. This high coverage depth provides a level of redundancy which 
can be exploited to deal with noisy reads: if multiple reads covering the same region of the DNA can 
be aligned together, then one can average over the symbols in the different noisy reads to obtained a 
cleaned-up read. However, if the read length is too short, this alignment cannot be done accurately, 
since noisy reads from other similar-looking regions of the DNA will be mis-aligned together and this 
would not help the noise averaging process. This minimum read length for accurate alignment would 
depend both on the noise statistics and the repeat statistics of the DNA sequence. What we show is 
that for the i.i.d. DNA model and memoryless read noise, as long as the noise level is less than the 
threshold given by condition ([T]), then accurate alignment can be achieved provided that the read 
length is longer than the longest repeat on the DNA sequence. This is exactly the same condition on 
the read length needed for noiseless assembly. Hence, one essentially can achieve error correction for 
free. 

The scheme we propose to achieve the fundamental limit under noisy reads has two stages: an 
error- correction phase, which aligns reads from the same region of the DNA and averages across 
them to produce cleaner reads, followed by an assembly phase, applying the greedy algorithm with 
approximate match to the cleaner reads. Provided that the noise level satisfies condition ([T]) to allow 
accurate read alignment, the noise level of the reads can be driven to be vanishingly small after the 
error-correction phase. Since it was shown in [7] that the performance of the greedy algorithm is 
continuous in the noise level, this implies noiseless performance can be achieved asymptotically. 

In the assembly literature, there are two approaches to deal with the noise in the reads. In the 
first approach, error-correction is performed jointly with assembly such as Velvet [10] and ABySS 
[1] which are based on de Bruijn graph. In the second approach, error-correction is performed first, 
followed by an assembly algorithm which assumes the reads are essentially clean. Examples of the 
algorithms are SHREC [3], Reptile [U], and Quake [S]. The latter is a separation approach, which is 
conceptually simpler. What we show in this paper is that, at least for the simple model considered 
here, the separation approach is in fact information-theoretically optimal, up to a certain threshold 
on the noise level. 

II. Formulation and Previous Results 

A. DNA Model 

The DNA sequence s = S1S2 ■ ■ ■ sq is modeled as an i.i.d. random string of length G with each 
symbol taking values according to a probability distribution Q = {Qa, Qc: Qg, Qt) on the alphabet 
{A, C, G, T}. To avoid boundary effects, we assume that the DNA sequence is circular, i.e., Si = Sj if 
i = j mod G] this simplifies the exposition, and all results apply with appropriate minor modification 
to the non-circular case as well. 

B. Noiseless Reads 

A noiseless read is a substring of length L from the DNA sequence. The set of reads is denoted by 
TZ = {ri, r2, . . . , Fiv}. The starting location of read i is ti, so rj = s[tj, ti + L — 1]. The set of starting 
locations of the reads is denoted T = {ti, ^2, • • • , ^at}, where we assume I < ti < t2 < ■ ■ ■ < t^ < G. 
We assume that the starting location of each read is uniformly distributed on the DNA and the 
locations are independent from one read to another. 

An assembly algorithm takes a set of A^ reads 7?. = {ri, . . . , rjv} and returns an estimated sequence 
s = s(7^). We require perfect reconstruction, which presumes that the algorithm makes an error 
if s 7^ s. A question of central interest is: what are the conditions on the read length L and the 
number of reads A^ such that the reconstruction error probability is less than a given target e for 



some algorithm? Define the minimum normalized coverage depth c^i^{L): 

Cmin(L) = hm_ ( n T\ ' (2) 

G->oo,L=LlogG A/covte, G, L) 

where A^min(e, G, L) is the minimum number of reads required to reconstruct the DNA sequence with 
probabihty at least 1 — e and Ncov{^, G, L) is the minimum number of reads to cover the DNA sequence 
with probability at least 1 — e. 

The main result for this noiseless read model is: 

Theorem 1. /?/ Fix an e < 1/2. Cmin{L) is given by 

^ ' \l ifL>2/H,,^y,, ^ ^ 

where Hj-enyi is the Renyi entropy of order 2 defined as: 

i^renyi:=-log ^ QI (4) 

s€{A,C,G,T} 

C. Noisy Reads 

Now we assume that the read process is noisy and consider a simple probabilistic model for the 
noise. A base s G {A, C, G, T} is read to be y G 3^ for some ground set y with probability 7r(y|s). 
Each base is perturbed independently, i.e. if r = ri,. . . ,rL is a read from the physical underlying 
subsequence s = si, . . . , s^ of the DNA sequence, then 

L 

F{r\s) = l[7i{ri\si). 

Moreover, it is assumed that the noise affecting different reads is independent. 

In the noiseless read case, we aim for perfect reconstruction. In the noisy read case, we aim for 
perfect layout. By perfect layout, we mean that all the reads are mapped correctly to their true 
locations. Note that perfect layout does not imply perfect reconstruction as the consensus sequence 
may not be identical to the DNA sequence on every single base. On the other hand, since coverage 
implies that most positions on the DNA are covered by many reads (growing with G), the consensus 
sequence will be correct in most positions if we achieve perfect layout. 

By modifying the greedy algorithm to allow for approximate instead of exact matches, the following 
performance can be achieved. 

Theorem 2. /?/ The modified greedy algorithm can achieve normalized coverage depth c{L) = 1 if 
L > L^lit ^ ■ L^lit ^ is a continuous function of the DNA and noise statistics and is strictly larger than 
LcrH whenever the noise is non-trivial. 
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III. Optimal Error Correction 

Theorem [2] shows that the critical read length increases from that in the noiseless case when the 
modified greedy algorithm is directly applied on the noisy read data. What we show in this section is 
that, if the noise level is below a certain threshold, there is actually enough redundancy in the noisy 
reads to perform almost perfect error correction. By applying the modified greedy algorithm on the 
cleaned-up reads, noiseless performance can be achieved asymptotically. 

First, we define the quality of a cleaned-up read r of length K: 
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Fig. 2. Plot of I/^rit" ^i^) 8-8 B^ function of the noise level S for the uniform source and symmetric noise model. 



where the mininiization is over all length K subsequences of the DNA sequence s. Also, let r'(f) be 
the minimizing subsequence, i.e. the one on the DNA sequence with the closest match to f . 
The main result of the paper is the following theorem. 

Theorem 3. (Error Correction) Assume: L > Lcrit, the noisy reads cover the DNA sequence asymp- 
totically, and the noisy read channel satisfies conditionUl Then there is an error- correction algorithm 
which takes as inputs the N noisy reads ri, . . . , r^r and outputs N cleaned-up reads Fi, . . . , r^ such 
that: 

• Each cleaned-up read fj is of length K such that K/L — > 1. 

• There is a sequence {tg} with r^ — )■ such that: 

lim P(max (i(f i) > tg) = 0. 

G—>oo i 

• Coverage: r'(f i), . . . , r'(r^) cover the DNA sequence asymptotically. 

IV. Proof of Theorem O 

The proof technique is based on the method of types [2] and the slight modification of strong 
typicality, as defined in 0. Let X be discrete set of size \X\. The set of all possible probability 
distributions on X is denoted by V{X). The set of all possible emprical distributions (types) of 
sequences x G X^ is denoted by Vk{X)- Clearly, Vk{X) C V{X). The cardinality of Vk{X) is 
upper bounded by {K + l)!-^! [2]. 

We say a sequence x G X^ is typical wrt the probability distribution F, if 

.N{a\x.) 
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for all a & X. Here, A^(a|x) is the number of occurrences of a G A* in x. Similarly, one can define 
the joint typicality of a set of sequences X := {xi,X2, . . . ,x„} wrt the probability distribution F on 
X"'. The main property of this definition of joint typicality is that if a set X of sequences is jointly 
typical wrt F, any subset SS of the sequences is jointly typical wrt the marginal distribution of F 

on SS. 



A. The Error Correction Algorithm 

Let Vk{,{A, C, G, T}) denote the set of all possible types of sequences in {A, C, G, T}^. For given 
P G Vk, we denote Fp to be the distribution of observing M independent samples of a base through 
the noisy read channel with the base distribution P. Clearly, 



F^{y„...,yM):= E {U^iy^l^)] ^s). (5) 




We also denote Fp by Fp. 

Let TZ be the set of all reads. For a fixed K and for each read, we extract all the substrings of 
length K from that read where each substring is called a i^-mer. We create the pool U consisting of 
all the /r-mers. 

A set of M K-mers, U = {ui, . . . , Um} is said to be a good alignment if U is jointly typical wrt Fp^ 
for some P G Vk- In the definition of "good alignment", we have considered all possible empirical 
distributions over DNA bases instead of considering only Q which is the true distribution of DNA 
bases. The reason is that the DNA sequence is long and contains atypical sequences of length K wrt 
the true distribution Q. Therefore, if we only use Q to define "good alignment", we lose the coverage 
of the DNA sequence. 

For any good alignment U, we take, for each component i = 1, . . . ,K, the Maximum Likelihood 
(ML) estimate Sj of the underlying base Si from uu, . . . ,UMi, assuming they are independent obser- 
vations of Si through the read channel. The averaged K-mei (si, . . . , sk) is denoted by fu- Lastly, we 
create TZ, the list of all cleaned-up reads, consisting of the averaged i^-mers from all possible good 
alignments taken from U. 

B. Analysis of the Algorithm 

To analyze the proposed error correction algorithm, we set K = L — L" and M = (5 log L for some 
constant a and /3 G (0,0.5). 

1) Error Correction Condition: To show the error correction condition of Theorem [31 we define £ 
be the event that there is a good alignment from U such that the averaged sequence f has quality 
(i(f) > tq- We define the following events: 
£i{J) '■ There is a good alignment from U with J < M i^-mers from distinct locations of the DNA 

sequence. 
S2{Mi) : There is a good alignment from U with two subsets of i^-mers each of which coming from 

the same location and having size Mi < ^. 
SsIMq) : There are M K-mers in U with at least Mq i^-mers from a single location but whose averaged 
sequence is not within Hamming distance Ktg from the DNA subsequence at that location. 
We claim that £ C £^{J) U ^2(Mi) U SsiM - (Mi - 1)(J - 2)). This is due to the fact that if a 
good alignment has less than J i^-mers coming from distinct locations and has at most one subset 
of K-meTS of size Mi coming from the same location, then at least M — (Mi — 1)(J — 2) Ji'-mers 
come from a single location. 

Therefore, the union bound gives us 

nn < n^iiJ)) + P(^2(Mi)) + F{S,{M - (Ml - 1)(J - 2))). (6) 



In particular, we will set J = Mi + 2, Mi = Mi + 1, and hence M - (Mi - 1)(J - 2) = M - VM, 
i.e., we are interested in the case when the majority of the reads come from a single location in event 
£3. We will upper bound each of the terms in (jHD and show that they all go to zero as M — )■ cxd. 



a) F{Si{J)) —7- 0; The event Si{J) happens when there is a good ahgnment U containing J 
i^-mers from distinct locations. Without loss of generality, let us assume that the first J K-meis of 
U = {ui, . . . ,uj\/} are sampled from distinct locations. Since U is jointly typical wrt Fp^ for some 
P G Vk, {ui, . . . uj} is jointly typical wrt Fp. Considering the fact that there are at most G'^ possible 
choices for J distinct locations on the DNA sequence and there are only polynomially many types, 
we can apply large deviation arguments to obtain 

P(^i(^)) < {K + 1YG'^2-^('^'''^^^k Di^p\\ULi^Q)-^^(')) ^ (7) 

for some Si{e) > 0. We proceed by computing the KL-divergence D (Fp\\ n/=i-^Q)- Let S denote a 
DNA base distributed according to P and Yi, . . .Yj be independent observations of S through the 
read channel. Then, 



«=i ^ {yu-,yj)<^y-' 



ULiFQiy. 



y: F/(yi, ...,yj) log f ^, i , \ - H{Y,, ...,Yj) 

(a) 



> J ^ Fp{y) log (-^] - H{Y,, 1^2, ... , Yj, S) 
= J E PisMy\s)\og(^]-H{S) 

se{A,C,G.T} 



se{A,C,G,T} 

where (a) comes from the fact that entropy increases by adding a new random variable. Next, we 
need to minimize the obove expression over all distributions in Vk- However, we obtain slightly 
looser bound if we minimize over all distribution in V. Let P* G P be the minimizer of the following 
program 



Ei = minJ y P( s)n (y\s) log (^$^] -2. 



y 
se{A,c,a,T} 



The optimization problem is a linear program and an optimal P* can be obtained by choosing a 
letter s* and put all the probability mass on it, with s* given by: 



■ Y- ^ I M f^iyl'"^ 

s = argmm } ^ 7i"(y|g) log 



and hence Ei becomes 



E, = jJ27riy\s*)\og( 
y \ 



Fqiy), 
7r{y\s)\ 



FQ{y)J 

= JDiniy\s*)\\FQ)-2. 

Hence, since L > 2/ifrenyi, provided that condition ([T]) is satisfied, P(£^i(J)) — )■ as J, G — )■ oo. 



b) ¥{S2{Mi)) — ^ 0; The event Si{J) happens when there is a good ahgnment U containing two 
sets of size Mi each sampled from a single location on the DNA sequence. Without loss of generality, 
we assume that the first Mi i^-mers are sampled from position ki on the DNA sequence and the 
second Mi K-mers are sampled from position ^2 on the DNA sequence with ki ^ k2- Both of these 
sets of K-iaeis are jointly typical wrt Fp^ for some P E Vk- Considering the fact that there are at 
most G^ possible choices for ki and k2 and applying large deviation arguments, we obtain 

.2) <{K+ l)4G'22-i^(mmpeT.^D(F^*^l||F«lF«i)-52(e))^ ^g^ 



for some S2{e). We proceed by computing the KL-divergence: 

D (fI^'^WF^'^F^'^) = D (fI'''\\F^'F^'') + 2D {f^'^\\F^'') . (10) 

Let §1 and §2 be the ML estimate of underlying base s obtained from yi, . . . , i/Mi and Vmi+i, ■ ■ ■ , ?/2Mi, 
respectively. The data processing inequality, c.f. [2j, implies 



By the law of large numbers, P(si = S2) -)■ 1 as Mi —^ 00. Hence D (P(si, S2)||P(si)P(s2)) -> -^(5') 
and D {P{si)\\Q{si)) — >• D {P{s)\\Q{s)) as Mi — )■ 00. Therefore, we can find a lower bound on the 
exponent of the probability of error by solving the following optimization problem: 

^2 = min E Pis)log(^). (12) 

se{A,C,G,T} \^y^) / 

One can show that the preceding optimization problem is minimized with P{s) = r p q! -,2 ■ Therefore, 

E2 = Hj-enyi- (13) 

Since L > 2/iJrenyi, P(^2(Mi)) ^ as Ml, G ^ cx). 

c) ¥(£s(M — VM)) —)■ 0: Fix a particular alignment , say U, with M — vM reads from 
same location. Let us call the DNA subsequence at that location the source sequence. A simple 
large deviations argument says that the probability that the ML estimate is not the same as the 
corresponding base of the source sequence is bounded by 2~^^ for some 7 > 0. Hence, the probability 
that the averaged sequence from U is at distance greater than Ktg from the source sequence is 
bounded by 

For large M, this is approximately 2~^*^'>"^<3. By the union bound, 

PriSsiM - ^/M))) < A2-^^^^"«, (14) 



where A is the number of such alignments. Let us bound A. First, the VM K-mers from other 
locations can at most come from G^^^ different locations. Let us now look at the number of possible 
choices of the M — yM i^-mers coming from the same location. There are G possible such locations. 
It is easy to show that for some constant // > 0, with high probability at each location there are at 
most T] log G K-mers. Hence, the number of choices of the M — \fM i^-mers coming from the same 
location is bounded by G{T]\ogG)^~^^ . Hence, 



A <G'^^^xG'(r7 log 6-)*^-^*^ 

and from (JH]), we get: 

Pr{E^{M - Vm))) < G'v^+i(^iogG')^^-^2-^^^^o. 

Recall that M = /3\ogG, K = LlogG - (LlogG)". A direct calculation shows that F{£3{M - y/M)) 
can be driven to zero if we choose tq = (logG)^4 — ^ 0, for example. 



2) Coverage Condition: To prove the coverage condition of Theorem [31 we need to show that TZ 
contains enough cleaned-up i^-mers covering the DNA sequence. We say a substring of length K is 
covered if there exists a cleaned up read r which is within Hamming distance Kt of the substring. 
Similarly, we say a base is covered if one of the substrings containing the bases is covered by a member 
of ^. 

Let C be the event that there exists a base not covered by members of TZ. Let Cj be the event that 
the ith base is not covered. Clearly, C = U^^Cj. Using the union bound and considering the fact that 
coverage condition is symmetrical for all bases, we obtain 

P(C) < GF{Ck). (15) 

There are K substrings of length K containing the i^th base of the DNA sequence. If none of the 
substrings is covered, then the event Ck happens. 

Let / denote the probability that a given substring of length K in the DNA sequence is not covered 
due to the reads containing it. We claim that 

P(Ci,)</^. (16) 

To see this, instead of considering all the substrings covering the i^th base, we consider only a subset 
of them consisting of j^ substrings with starting positions at {L — K)j for j G {1, . . . , ■j;zj^}- For 
each substring, the probability of not being covered is at most / and there exists no read that can 
contain two of the substrings. Therefore, the probability of missing all the substrings becomes /^-^ 
due to independence of probabilities. The inequality comes from the fact that there are cases where 
the i^th base is covered but not from the chosen subset of substrings. 

We need to obtain an upper bound on /. We look at the interval of length L — K before the 
given substring. The number of reads k with starting location in the interval has Poisson distribution 
with parameter A(L — K). We assume that no other read sampled from outside of the interval can 
assist us in cleaning-up the substring. Clearly, if < A; < M then there is not enough reads and 
hence the substring is not in TZ. For iM < k < {i + 1)M, we partition the reads into i disjoint sets 
each of which having M members. For this case, the substring is not covered if none of the sets is a 
good alignment. For given i, let Vj for j E {I, . . . ,i} be the event that the jth subset is not a good 
alignment. The probability that the substring is not covered is W{n'^--^T>j). We claim that the events 
Vj are independent and therefore, P(n}^iPj) = n5=iP(^j) = (P(^i))*- This is due to the fact that 
the type of the substring irrespective of being close to Q or not is included in the definition of "good 
alignment". In fact, the definition of good alignment is universal and includes all mother substrings 
present in the DNA sequence. 

Using large deviation arguments, the probability that one of the sets fails to pass the good alignment 
test is bounded by 2~^'^^^\ for some ((e) > 0. Therefore, 

; < f (A(L-K))^e-^(^-^) ,|,,^|,,,, 

fc=0 ^' 

We can upper bound it further by / < e-KL-K){i~2 m ) Ugj^g ^-\^q upper bound on /, we obtain 

-\K\ 1-2 ~^r- 



P(C) <Ge V / . (17) 

One can show that if A^ > f ln(G) = A'^ov then P(C) -^ 0. 
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